Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support configurable delimiter for seed files, default to comma (#3990) #7186

Merged
merged 18 commits into from
Aug 1, 2023

Conversation

ramonvermeulen
Copy link
Contributor

@ramonvermeulen ramonvermeulen commented Mar 17, 2023

resolves #3990

Original description

Added support to configure a custom delimiter for seed files, defaults to comma.

Copilot description

This pull request introduces several changes related to seed files and configuration options. The most important changes include adding new test classes and methods to test different scenarios related to seed files with unique delimiters, adding a new configuration option for specifying a delimiter for seed files, and modifying various functions to support the new delimiter configuration option.

Summary of changes:

  • tests/adapter/dbt/tests/adapter/simple_seed/test_seed.py: Added a new test class and several test methods to test seed files with unique delimiters. [1] [2] [3]
  • core/dbt/contracts/graph/model_config.py: Added a new configuration option delimiter to the SeedConfig class.
  • tests/functional/artifacts/expected_manifest.py: Added a new configuration option delimiter to a function in expected_manifest.py.
  • tests/unit/test_contracts_graph_parsed.py: Added new configuration options delimiter to different functions. [1] [2] [3]

Other changes:

  • tests/adapter/dbt/tests/adapter/simple_seed/fixtures.py: Added a new model to the list of fixtures.
  • .changes/unreleased/Features-20230714-202445.yaml: Added support for configuring a delimiter for a seed file.
  • tests/functional/list/test_list.py: Added a new configuration option to a function.
  • core/dbt/context/providers.py: Added a new parameter to a function.
  • core/dbt/clients/agate_helper.py: Added a new parameter to a function.

Checklist

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramonvermeulen! I think we'd want a functional test for this as well, guaranteeing that a CSV with a nonstandard (non-comma) delimiter actually works.

From my POV, this doesn't feel super high-priority — dbt's seed functionality is intentionally limited / narrow in scope, and I hesitate before adding more, given the added maintenance burden — but this is a very straightforward & self-contained change. TSVs are common enough in the wild.

core/dbt/contracts/graph/model_config.py Outdated Show resolved Hide resolved
@@ -135,12 +135,12 @@ def as_matrix(table):
return [r.values() for r in table.rows.values()]


def from_csv(abspath, text_columns):
def from_csv(abspath, text_columns, delimiter = ","):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should we add support for more than just delimiter here? Any/all **kwargs supported by agate.from_csv / Python's stdlib CSV reader?

Thoughts:

  • That can be out of scope for now / for this PR
  • It would risk locking us further into using agate (which I don't love), or at least the Python stdlib csv reader (which I don't mind as much)

@jtcohen6 jtcohen6 added Team:Language ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering labels Mar 22, 2023
@ramonvermeulen
Copy link
Contributor Author

ramonvermeulen commented Mar 27, 2023

Thanks @ramonvermeulen! I think we'd want a functional test for this as well, guaranteeing that a CSV with a nonstandard (non-comma) delimiter actually works.

From my POV, this doesn't feel super high-priority — dbt's seed functionality is intentionally limited / narrow in scope, and I hesitate before adding more, given the added maintenance burden — but this is a very straightforward & self-contained change. TSVs are common enough in the wild.

Hi @jtcohen6, sorry for the late reply, I had a week of vacation.
I think adding a functional test is a good idea! I will dive into the functional test set-up and see if I can make a good functional test.

The reason I made this PR is because I encountered customers delivering there "csv" files with for example pipes or tabs as delimiter. Maybe because that is a standard within their organization. I thought it would be nice to have some more control about the configuration of the csv parser for the seed files, but I get your point about that it will increase the maintenance.

Question: Should we add support for more than just delimiter here? Any/all **kwargs supported by agate.from_csv / Python's stdlib CSV reader?

Thoughts:

  • That can be out of scope for now / for this PR
  • It would risk locking us further into using agate (which I don't love), or at least the Python stdlib csv reader (which I don't mind as much)

I find this a hard one, it is possible to add support for all the parameters of the agate.from_csv function, however this would lock dbt further in to using agate. The delimiter parameter, is an parameter that is also supported by the stdlib csv reader. I would say support for other parameters is out of scope for now.

Copy link
Contributor

@JCZuurmond JCZuurmond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core/dbt/clients/agate_helper.py Outdated Show resolved Hide resolved
ramonvermeulen and others added 3 commits March 29, 2023 13:21
Co-authored-by: Cor <jczuurmond@protonmail.com>
Conflicts:
	test/unit/test_contracts_graph_parsed.py
@eliwoods
Copy link

eliwoods commented May 23, 2023

@ramonvermeulen I am rather interested in this PR, is there anything I can do to help get it over the line?

@ramonvermeulen
Copy link
Contributor Author

@ramonvermeulen I am rather interested in this PR, is there anything I can do to help get it over the line?

Hi @eliwoods, I think functional tests are still required to get this PR over the line @jtcohen6? I've been quite busy lately so feel free to pick this up / contribute on this as you wish!

@ramonvermeulen ramonvermeulen requested a review from a team as a code owner May 26, 2023 14:35
@ramonvermeulen ramonvermeulen requested review from emmyoop and VersusFacit and removed request for nssalian and a team May 26, 2023 14:35
@ramonvermeulen
Copy link
Contributor Author

ramonvermeulen commented Jun 23, 2023

@jtcohen6 @eliwoods @JCZuurmond I added a couple of functional tests for this feature, and fixed the failing seed related integration tests. Only one integration test called test_snapshot_hard_delete is failing on my local system. However this test is failing for me on the main branch as well, so I don't think it is related to this pull request.

Copy link

@thomas-vl thomas-vl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramonvermeulen Thank you for putting all this work it! I've looked it over and pulled it down for some manual testing. Every looks great ✨ Gonna run it through our CI workflow, and then we should be good to go 🙂

@QMalcolm
Copy link
Contributor

QMalcolm commented Jul 14, 2023

Eek, foiled by mypy. Perhaps I should have run the pre-commit hooks locally while reviewing 🙃

So there are two classes of mypy issues we're seeing. One is some simple formatting changes, which are simple fixes. The other is the errors like core/dbt/cli/main.py:403: error: <nothing> has no attribute "command" [attr-defined]. These errors are actually due to a recently released version of click, specifically click 8.1.4. We've fixed this in core, thus to fix this I'd recommend merging main into this branch to get it up to date.

@QMalcolm QMalcolm self-requested a review July 14, 2023 17:04
merge main into configurableDelimiterSeeds
@ramonvermeulen
Copy link
Contributor Author

ramonvermeulen commented Jul 14, 2023

Eek, foiled by mypy. Perhaps I should have run the pre-commit hooks locally while reviewing upside_down_face

So there are two classes of mypy issues we're seeing. One is some simple formatting changes, which are simple fixes. The other is the errors like core/dbt/cli/main.py:403: error: <nothing> has no attribute "command" [attr-defined]. These errors are actually due to a recently released version of click, specifically click 8.1.4. We've fixed this in core, thus to fix this I'd recommend merging main into this branch to get it up to date.

Thanks for taking a look at my PR! Just pushed a commit merging the upstream main back into my branch, and ran pre-commit, so hopefully that fixes the CI.

@QMalcolm
Copy link
Contributor

Thanks for taking a look at my PR! Just pushed a commit merging the upstream main back into my branch, and ran pre-commit, so hopefully that fixes the CI.

Absolutely! Thank you for the quick turn around with the fixes. Also, sorry that it took me so long to get to my initial review. It's been a busier few weeks for me than I anticipated 😅 I'm gonna kick off the CI again

Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving 🙂

@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Merging #7186 (847eb2d) into main (7872f6a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #7186   +/-   ##
=======================================
  Coverage   86.22%   86.22%           
=======================================
  Files         174      174           
  Lines       25515    25517    +2     
=======================================
+ Hits        22000    22002    +2     
  Misses       3515     3515           
Files Changed Coverage Δ
core/dbt/clients/agate_helper.py 96.52% <100.00%> (ø)
core/dbt/context/providers.py 88.66% <100.00%> (+0.01%) ⬆️
core/dbt/contracts/graph/model_config.py 93.81% <100.00%> (+0.01%) ⬆️

@QMalcolm QMalcolm merged commit 6130a6e into dbt-labs:main Aug 1, 2023
94 checks passed
@saviorand
Copy link

Installed the pre-release version and trying this in my dbt_project.yml, doesn't seem to work.. Not sure what's the issue:

seeds:
  <project_name>:
    <seed_directory_name>:
      +delimiter: ";"

@ramonvermeulen
Copy link
Contributor Author

Installed the pre-release version and trying this in my dbt_project.yml, doesn't seem to work.. Not sure what's the issue:

seeds:
  <project_name>:
    <seed_directory_name>:
      +delimiter: ";"

Hmmm, not sure what is going wrong. Currently I am leaving for holidays for a week, but will definitely look into this once I am back.

@ramonvermeulen
Copy link
Contributor Author

ramonvermeulen commented Aug 31, 2023

Installed the pre-release version and trying this in my dbt_project.yml, doesn't seem to work.. Not sure what's the issue:

seeds:
  <project_name>:
    <seed_directory_name>:
      +delimiter: ";"

@saviorand
I just tried to reproduce this bug using dbt-core and dbt-postgres version 1.7.0b1 but I am not able to.

I created a new dbt project using dbt init and created the following seed file:

a|b|c
1|2|3
4|5|6
7|8|9

On location seeds/test_file.csv with config:

seeds:
  <project_name>:
    +delimiter: "|"

Works without problems

On location seeds/sales/test_file.csv with config:

seeds:
  <project_name>:
    <seed_directory_name>
      +delimiter: "|"

Also works without problems

Could you provide more info in reproducing this issue, maybe I am missing something?

@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label Sep 25, 2023
@FishtownBuildBot
Copy link
Collaborator

Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#4126

@CarMoreno
Copy link

Great feature!!. Just one point: What about TSV files? In that case, the delimiter would be \t and won't work. I know it can be considered an edge case, I had to change my TSV file to use | as the delimiter character and my seed worked.

@ramonvermeulen
Copy link
Contributor Author

ramonvermeulen commented May 8, 2024

Great feature!!. Just one point: What about TSV files? In that case, the delimiter would be \t and won't work. I know it can be considered an edge case, I had to change my TSV file to use | as the delimiter character and my seed worked.

Good edge case actually that I didn't think about yet, I think as of now it works for any 1 character delimiter. I am currently on a longer break, but might look into this once I get back to see if I can add support for tab delimited files as well.

Small Update: when I take a look at the from_csv docs from agate I would expect that it would work with \t as well, not sure what the reason is that it does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce "delimiter" parameter for seeds? Allow non-csv file types